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Abstract 


Within distributed fault -tolerant systems the term force-fight is colloquially used to describe the level of 
command disagreement present at redundant actuation interfaces. This report details an investigation of 
force-fight using three distributed system case-study architectures. Each case study architecture is 
abstracted and formally modeled using the Symbolic Analysis Laboratory (SAL) tool chain from the 
Stanford Research Institute (SRI). We use the formal SAL models to produce k-induction based proofs of 
a bounded actuation agreement property. We also present a mathematically derived bound of redundant 
actuation agreement for sine-wave stimulus. The report documents our experiences and lessons learned 
developing the formal models and the associated proofs. 
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1 Introduction 


The document has been generated under NASA Task Order NNL10AB32T. It presents the modeling and 
exploration of the control system case-study architectures presented in [1], 

In this document we have constructed formal model abstractions of key system strategies related to 
redundancy management. We use the models to prove characteristics of case-study architectures. In this 
initial work these proofs a single property is selected for formal examination. This selected property, 
colloquially termed force-fight denotes the level of command disagreement that exists across redundant 
actuation interfaces. 


1.1 Scope 

This work is based on the Phase 2 control system case-studies that are documented in [1], Although the 
case studies embody control-system models, the focus of our work is not related to control theory. The 
focus of this work is the formal investigation of distributed-system redundancy management logic. The 
control element of the problem is included here only to enable the interaction of the distribution and 
replication management policies with the higher level requirements of the external control system. For 
this initial work, the behavior of the control law was abstracted out of the formal representation, to enable 
simpler bounds of agreement to be formally established. 

Full listings of the SAL and Matlab models presented herein are available at the NASA DASHlink 
site AFCS - Distributed Systems (https://c3.nasa.gov/dashlink/projects/79/). 


1.1.1 Background and Motivation 

During Phase 1 of this research, most of the system modeling and analysis activities were focused on 
modeling system communication infrastructures and their associated protocols. However, during the 
review of the asynchronous case study [2], we learned that many real-world systems neither built upon 
nor leveraged the layered fault-tolerant services prescribed by formal fault-tolerance theory. In place of 
structured, layered, fault-tolerant services, these systems implement application-specific, fault mitigation 
strategies derived from field-proven domain experience. In such systems, as illustrated in [1], the system 
fault-tolerant strategies are often dispersed throughout the system control-law implementation. This 
dispersal complicates incremental verification, as the system fault tolerance is coupled with the 
application’s behavior. Consequently, formal validation (i.e., formally proving the correctness and 
sufficiency of the system fault-tolerance) is also non-trivial. However, given the wide-spread proliferation 
of these techniques, we believe that developing a formal framework that enables the validation of such 
system strategies will be very beneficial. 

We hope that this analysis will yield more systematic review, and potentially automation, of some of 
the validation activities required for this class of systems. To this end, as part of the Year 3 efforts, we 
intend to explore the feasibility of test generation from the system formal model to support the current 
manually-generated design validation activities. This work supports improved completeness claims with 
respect to system-level validation activities. 

In addition, given that many aspects of this class of system design are based upon years of domain- 
centric experience, we hope that formally capturing the knowledge associated with these systems will 
offset the risks associated with retaining this critical knowledge as the current workforce retires. 

Finally, by contrasting the performance of the different case-study architectures we further hope to 
develop some insights about the potential and strengths and weaknesses related to the theoretical fault- 
tolerance strategies and industrial pragmatic fault-tolerance approaches. 
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1.2 Domain- Specific Architecture Evolution 

Our discussion above explained that the design philosophy of many real-world approaches to fault 
tolerance has evolved pragmatically as the systems have taken on increasing levels of authority and 
system responsibility. Osder [3] describes this evolution within the flight control system architectures as 
analog and later digital electronic technologies where introduced. As these system architectures evolved, 
the increasing dependence and specific failure models of digital hardware need to be mitigated and 
domain-specific architectural techniques [4] developed. These techniques were largely influenced by the 
voting and fault detection strategies used to select among multiple lanes of redundancy. Some early 
designs leveraged global synchronization to reduce the complexity of cross-lane voters. With system- 
level synchronization, the error tolerance of the voters can be easily calculated from the system’s 
precision performance. However, the potential brittleness and common-mode influence of the system 
synchronization service led others to develop asynchronous cross-channel voting strategies [4]. Further 
background descriptions of synchronous and asynchronous system architectures are given in [5] [6] . 

For commercial flight control, the asynchronous design strategy is most prevalent today. This strategy 
is an interesting choice, given the complexities of designing and validating such systems, which are 
complicated due to the inexact agreement across redundant lanes 1 . However, it appears that modern 
flight control systems do not to require exact agreement. Therefore, the approximate agreement properties 
possible with asynchronous system architectures are sufficient. On the positive side, the asynchronous 
architectures based on approximate agreement yield systems that claim to be remarkably fault-tolerant to 
communication loss; for example, channels may remain operational with up to 20% communication 
packet loss. Interestingly, in the systems that we analyzed, Byzantine fault-tolerance is not specifically 
addressed other than for strategies to isolate asymmetric faults, as detailed in [1] . Hence, this aspect will 
be a part of our research agenda. Of particular interest is the behavior of the system during the time 
window required for asymmetric failure detection 2 . 


1 As outlined in [7] the design of the voting strategies and control used in asynchronous systems are complex and 
non-trivial 

2 Up to 10 seconds of delay is required to confirm an asymmetric failure. 
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2 Case -study Architecture Review 

The following sections present a summary of the architectures analyzed here -in. Further details of the 
detailed architectural mechanism are given [1], 


2.1 Asynchronous Tripplex High Integrity Control 



Triplex 

Hardware 

actuation 


Dual 

Hardware 

Actuation 


Computation Modules 


Actuation Sense Modules 


Figure 1 Asynchronous Three Channel Switched Ethernet Architecture 

The first case-study architecture is illustrated in Figure 1 . This system comprises three asynchronous 
computation modules (CM's) connected to four actuation and sense modules (ASMs). Each computation 
module communicates with the ASMs using a dedicated Ethernet network. The computation modules also 
communicate among themselves using the Ethernet networks. 

All computation is done using self-checking hardware, incorporating a command and monitor 
computation lane within each CM. The monitor lane performs independent computation of the control 
algorithms and continuously monitors the output of the command lane. For each successful comparison, 
the monitor lane authenticates the validity of the commanded output by updating the values of the 
independent command signature and command confirmation heartbeat that are embedded within the each 
output message. The signature and heartbeat sequence 3 are validated by each ASM before the out 
message is used. A computational error by the command processor would result in an invalid signature or 
heartbeat value and the ASM would reject the message. To monitor the integrity of the Ethernet network 
transportation, the system also incorporates a wrap-back acknowledgement protocol. The ASMs also 
reflect an encoded function of each input message back to the sender for end-to-end integrity 
confirmation. The monitor lane of each control computer also monitors this reflected status of the 
previous command. If this reflected status is found to be erroneous, the monitor ceases authentication of 
the output command scheme and heartbeat. The lack of a valid signature and/or heartbeat signifies that 
the control channel is invalid. 


3 For any message to be considered valid, a heartbeat sequence counter embedded within the message is required to 
increment in a prescribed sequence. 
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The ASMs also use self-checking hardware with a command and monitor lane in each ASM that prevent 
failures in ASM processing from corrupting sensor input or causing hazardous hardware actuation. 

In each ASM. a hybrid, mid-value selection function is used to select between the computation 
channel output commands. This selection is a function of how many of the computational output 
commands that an ASM receives are valid, with validity determined by the reception passing some in-line 
syntax tests that do not involve comparison among the command inputs. The function is implemented as 
follows: 

• If all three command streams are valid, an ASM selects the mid-value of the three valid 
computation input streams. 

• If only two of the computation input command streams are valid, the ASM uses the previous mid- 
value selection to supplement the two remaining streams, substituting the previous mid-value 
selection in-place of the missing or invalid command stream. 

• If only one computation command input stream is valid the ASM uses this stream. 

All tasking and communication within the system is implemented using a timed-asynchronous, 
model, i.e., each component independently executes a local periodic schedule of activity. The ASMs 
executes at the highest rate of the system, for example 80 Hz; whereas, the computations of the control 
computers are distributed across multiple rates, ranging from 80 to 1 Hz. Multiple ASMs operate 
cooperatively to drive the output actuation services, connected in dual and triplex configurations. The 
ASMs also process external and feedback sensor data and provide it to the CMs. 

The system incorporates a number of strategies to ensure that the control computers remain aligned 
with respect to the commanded state. These strategies include the following: 

• Internal integrator and discrete state equalization — where each of the control computers 
continuously adjusts its state towards a fault -tolerant, mid-value function of the values from all 
operating lanes (which translates to majority voting for discrete signals) 

• Communication asymmetry management — where a control computer or ASM that is confirmed 
to be asymmetrically communicating, (i.e., a system component that is communicating with only 
a subset of the other system components) is isolated from influencing the group 

For the initial investigation of actuation agreement, this document does not elaborate on these 
strategies; however, the details can be found in [1], Our rationale is presented with the initial system 
modeling in Section 4. 



2.2 Asynchronous BRAIN-based Ethernet Architecture 



The second case-study architecture is depicted in Figure 2. This system comprises the same components 
of the first system, but in place of the three switched Ethernet networks, a single Ethernet-based Braided 
Ring Availability Integrity Network (BRAIN) is used to connect the system components. 

In this case-study architecture the asynchronous BRAIN 3.0 protocol is considered. This is a layered 
protocol that can be deployed on top of a standard Ethernet or a profiled Ethernet (e.g., ARINC 664) 
implementation. BRAIN 3.0 assumes that routing authentication and bandwidth fairness allocation is 
performed within the underlying Ethernet layer. For example, the underlying Ethernet layer can use fixed 
routing tables and configured bandwidth allocation. The BRAIN 3.0 protocol leverages these underlying 
properties to implement data integrity acceptance criteria that are a function of qualified, disjoint, data- 
distribution path mapping. That is, received messages are not accepted as valid unless multiple copies of 
the messages arrive from totally disjoint communication paths and the messages are bit-for-bit identical, 
with the disjoint communication paths being enforced by path mapping mechanisms in the underlying 
Ethernet layer. Using the enforced message routing strategy, we believe that the BRAIN 3.0 will yield a 
dual fault-tolerant (assuming non-colluding faults 4 ) high-integrity message broadcast guarantee. A full 
summary of the BRAIN 3.0 protocol message routing and details of the data acceptances tests are given in 
[ 1 ], 

Note that the role of the network in this second architecture is more integral to the system redundancy 
management arguments than the network of the initial case-study architecture. In the initial architecture, 
an end-to-end wrap-back protocol was implemented above the network to detect network data corruption. 
The BRAIN-based system also incorporates a number of high-level strategies to mitigate asymmetric 
communication failure. 

In the BRAIN 3.0 architecture, the underlying communication system is intended to guarantee a 
Byzantine resilient data broadcast in the presence of up to and including two non-colluding faults. We 
believe that this property of guaranteed data broadcast consistency will greatly improve system 


4 Colluding faults are faults that act in support of each other. 
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performance while reducing the system complexity and overheads. This idea will be investigated as the 
two system architectures are modeled and compared. 

Another area where the BRAIN 3.0 and the initial case-study architecture differ is the comparison of 
the command and monitor lane outputs. In the initial architecture, the control computer comparison is 
implemented in software, with the monitor implementing bounded comparison of the command lane 
output prior to authentication of the command transmission over the network. In the BRAIN 3.0-based 
architecture the command and monitor comparisons are performed with the network distribution 
function 5 . To produce congruent output, the COM and MON lanes of each self-checking pair rendezvous 
and synchronize using the dedicated link that connects them. Other than this synchronization between 
COM and MON, all other data flow of the BRAIN 3.0 architecture is asynchronous, with each pair 
executing a local periodic schedule of tasking and communication activity. 


5 For this scheme we assume that the output of the lanes is bit-for-bit identical. Should dissimilar processing 
hardware be employed the scheme assumes a fixed point arithmetic processing to ensure bit-for-bit lane congruency 
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2.3 Synchronous Two-tier Network Architecture 



ASM 


Figure 3 Two-tier Synchronous Network Case -study Architecture 


The final case-study architecture is depicted in Figure 3. This system comprises the same components of 
the first system, but in place of the three switched Ethernet networks, a hybrid network architecture is 
deployed. In this architecture, a two-tier network architecture is considered. The control computers are 
fully interconnected and synchronized using a Time -Triggered Ethernet [7] network backbone. We 
assume that the quality of synchronization achieved in such a configuration will yield a synchronization 
precision of 25 ps. 6 To communicate with the ASMs, each control computer implements a dedicated 
access bus 7 connection using a typical access bus protocol, for example TTP [8]. To maintain system 
synchrony, the TTP access bus connections are also synchronized to the master Time-Triggered Ethernet 
schedule and timeline. Hence, this final system is globally synchronous with all system tasking and 
communication coordinated in accordance with the global Time-Triggered Ethernet timeline. 

In this third architecture, a separate TTP network is dedicated to each control computer channel. 
Given this configuration, it is possible for asymmetric communication faults to manifest between the 
control computers and the ASMs. Therefore, we assumed that this third architecture deploys similar 
network management and asymmetric communication fault detection strategies as the first architecture. 
We further assumed that this synchronous architecture implements an end-to-end wrap-back protocol to 
mitigate network component integrity failures. 


6 A typical precision achieved in industrial configurations 

7 The Term "access bus" is used to denote the lower tier of a two-tier network. 
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3 A Discussion of the Case-study Architectures 

3.1 Overview 


The case-study architectures introduced in the previous section presents a number of different redundancy 
management policies. To facilitate a comparison of all three architectures, a single property is selected for 
formal examination. This selected property, colloquially termed force-fight denotes the level of command 
disagreement that exists across redundant actuation interfaces. Where multiple ASMs couple to a shared 
actuation interface, it is important that they maintain command congruency, since any discordance in 
command output may manifest as opposing forces applied to the actuation surface, which contribute 
unwanted surface stress that can in turn result in premature surface degradation and/or aging 8 . This is 
illustrated in Figure 4 below. 


Computation Modules 


Actuation Sense Modules 



Figure 4 Actuation Force-Fight Instrumentation 


Agreement measured 
at actuation interface 
Post OASM Selection 

as Abs(cmd1 - cmd2) 


Note that the level of agreement maintained at the actuation interface is determined solely by the 
distributed architecture redundancy management policies (including the degree of synchronization the 
redundancy management policy uses). In all of the three case-study architectures, the quality of actuation 
agreement is largely influenced by the emergent properties of the hybrid mid-value-selection function 
implemented within the ASMs, given possible asynchronous behavior of the commanded output streams 
from the control channels operating in normal and faulted conditions. 

3.2 Assumed Failure Modes 

The first stage in any architecture analysis is to define the assumed failure modes of the system 
components and communication. At first glance, the self -checking mechanisms of the computation and 
ASM hardware would normally lead to a fail-silent failure model. However, since the network hardware 
in the first and third architectures is not self-checking, this assumption would be invalid. These 


8 This is particularly important with composite airframe materials 
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architectures use a wrap-back based integrity check and isolation scheme that cannot contain all integrity 
violations; there exists an unlikely, but non-zero, probability of a network-induced corruption escaping 
the fault detection capability. Therefore, we assume that a single erroneous value may escape from the 
self-checking computer channel without detection. Persistent integrity errors are not assumed, since the 
encoded heartbeat protocol will cease command authentication on detection of the first error. 

With respect to message distribution in the first and third architectures, we assume that some fault 
conditions may cause a compute channel to communicate asymmetrically with the ASMs. After review, 
we found that, in some systems, there is a significant fault detection lag in the systems' logic to mitigate 
such asymmetric communication. Therefore, under worst-case conditions, communication asymmetry 
may persist and contribute to output non-congruence before isolation takes place. We further assume that 
up to two compute channels may be faulty at the same time. 

For the second (BRAIN-based) architecture, we assume the claimed fault model of the BRAIN 3.0 
protocol. That is to say, faults are consistently observed by all the consuming components, and the 
network and computation functions are fail-silent with respect to integrity violation. Note that at the time 
of writing this report, this fault model had not been formally verified, although informal experiments 
based on model checking have demonstrated the assumed property. 


13 



4 Formally Modeling Actuation Agreement Using SAL 

4.1 Formal Model Description 




Figure 5 Formal Model High Level Structure 

To make the initial analysis more tractable, we focused our initial modeling on the behavior of the output 
stage in isolation using the techniques proposed in [9]. From a high-level view, this model can be 
conceived as a system where the cross-channel state equalization functions result in perfect control- 
channel equalization. That is to say, we assume that there are no errors attributed to the internal state 
divergence of the control channels. Further, we simplified our initial analysis by analyzing the output 
behavior in an open-loop configuration to remove the complication of incorporating the continuous plant- 
model into the control feedback path 9 . A high-level pictorial representation of the simplified model is 
shown in Figure 5. 

In this initial model, the controller can be viewed as a pass-though transfer function with a gain of 1 . 
The dominant behavior within the model comprises the interaction of the asymmetric fault -injection with 
the asynchronous control execution and hybrid mid-value-selection logic of the ASMs. 

The initial formal abstraction comprises the synchronous composition of the following components: 

• The plant module produces a saw-tooth wave form and models the asynchronous sampling of 
this waveform by the three computer control channels. 

• The mvs module models the hybrid mid-value-selection function of the ASMs. 

• The agreementmonitor module comprises synchronous observer monitoring of key properties 
of interest. 

• The faultinjectionbus module connects the plant and mvs modules. It distributes the computed 
output from the plant to the mvs and injects value and/or omission faults on selected signal 
paths. 


1 As part year 3 of our research we expect to integrate these details into the formal abstraction framework. 
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4.1.1 Asynchronous Interaction 


In asynchronous systems modeling, time-dependent interaction is key and is necessary to understand the 
phase-dependent behavioral relationships. In the initial formal abstraction, this aspect of the system is 
explored using a bounded non-deterministic selection of the plant sampling point. It is captured in the 
synchronous_plant_sampling module of the mvs.sal file. The sampling phase of the first control channel 
is initialized to be non-deterministically assigned within the waveform period. 

% ttl is initially constrained to be greater dt and less than period 
ttl IN { z : REAL | z >= dt AND z <= period}; 


The sampling phase of the two remaining channels is initialized to be set within a sample period of 
the value used for the first channel 

% tt2 and tt3 are constrained to be within one interval of dt less than ttl 
tt2 IN { z : REAL | z >= ttl - dt AND z <= ttl AND z >= 0 AND z <= period }; 
tt3 IN { z : REAL j z >= ttl - dt AND z <= ttl AND z >= 0 AND z <= period }; 

In every state transition, the sampling time of the first channel is increased by the sample period. 
However, since the plant waveform is only defined for a single -period boundary, the update of the 
channel sampling point is constrained to fold back into the plant waveform period once it reaches the 
interval within dt of the waveform period. 

ttl' = IF ttl <= period - dt THEN ttl + dt ELSE ttl + dt - period ENDIF ; 


At each state transition, the sampling points of the second and third channels are bounded to a non- 
deterministically assigned interval within a sample period dt of the current tt 1 value. 

tt2 in { z:real | z >= ttl and z <= ttl + dt}; 
tt3 in { z:real | z >= ttl and z <= ttl + dt}; 


Note that this arrangement does not maintain a consistent sampling interval for the second and third 
channels. However, we believe that this is not necessary for the investigation of the ASM mid-value- 
selection behavior. This assignment also forces the exploration of all possible phase relationships of tt2 
and tt3 with ttl. Hence, in the update of the sample time period, we do not need to introduce additional 
phase values to model phase drift. 

To generate sample values, the plant simply returns the value of the test waveform for each of the 
sample time points. Note that at the bounds of the period waveform within channels 2 and 3 may lie 
beyond the scope of the sample waveform definition. To mitigate this, the sample times are simply 
reflected back into the waveform space by either subtracting the period for times greater than the period 
boundary or adding the period boundary for samples less than zero. This same structure is used for all 
sample channels 

ypl = waveform (ttl) ; 

yp2 = IF tt2 <= period THEN waveform(tt2) ELSE waveform(tt2 - period) ENDIF; 

yp3 = IF tt3 <= period THEN waveform(tt3) ELSE waveform(tt3 - period) ENDIF; 

Note that verify the validity of the sampling assumptions the following test lemma was added 

testl: LEMMA asynchronous_plant_sampling |- G(ttl >= 0 AND ttl <= period) 
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4.1.2 Modeling Synchronous Interaction 

To model the time-triggered case study, we need to modify the channel timing alignment. This 
modification is done by constraining all three channels to a defined synchronization precision that is 
specified using the additional parameter sync precision. This specification is captured in the 
synchronous _plant_sampling module. 

% ttl is initially constrained to be greater dt and less than period 
ttl IN { z : REAL | z >= sync_precision AND z <= period}; 

% tt2 and tt3 are constrained to be within one interval of sync precision interval less than ttl 
tt2 IN { z:real | z >= ttl-sync_precision AND z <= ttl}; 
tt3 IN { z:real | z >= ttl-sync_preci sion AND z <= ttl}; 


In the model we assume an achieved synchronized precision of 25 ps a typical value for industrial 
systems. For the model, the tighter agreement bound based on the synchronization precision was also 
added 


sync_precision: real = 0.000025; 

sync_e: REAL = 1.0 * sync_precision * p2p 

4.1.3 Fault-Injection 

The faultinj ectionbus modules are responsible for the distribution of the values from the plant modules to 
the two instances of the mv s module. These modules also introduce erroneous value and signal omission 
faults. Two fault injection scenarios are captured using the faultinjectionbus_I023 and 
faultinj ectionbus _byzantine_channel models. The structure of these modules is equivalent. The signals 
from the plant are input as ypn variables. These values are separately assigned to each of the output mvs 
channels that are cn_xn values. The cnjbn flags are used to validate the channel data. When set to 
FALSE, the data from the channel is considered invalid and the channel is omissive. In the 
faultinj ectionbus _I023, one of the mvs client modules is subjected to inconsistent omission failure of one 
or two computer control channels via the non-deterministic assignment of the validity flags for channels 
two and three. 


faultinjectionbus_l023 : MODULE = 

BEGIN 

INPUT 

ypl, yp2, yp3: REAL 
OUTPUT 

cl_xl, cl_x2, cl_x3: REAL, 
c2_xl, c2_x2, c2_x3: REAL, 
cl_bl, cl_b2 , cl_b3: BOOLEAN, 
c2_bl, c2_b2 , c2_b3: BOOLEAN 

DEFINITION 

% OUTPUT values are good 

cl_xl = ypl; cl_x2 = yp2 ; cl_x3 = yp3; 

c2_xl = ypl; c2_x2 = yp2 ; c2_x3 = yp3; 


% First Channel of MVS gets all good status values 
cl_bl = TRUE; cl_b2 = TRUE; cl_b3 = TRUE; 

% Second Channel of MVS channels 2 and 3 are inconsistently omissive 

c2_bl = TRUE; 

c2_b2 IN { TRUE, FALSE}; 

c2_b3 IN { TRUE, FALSE}; 


END; 


In the faultinj ectionbus _byzantine_channel, the Byzantine failure of one of the computer channels is 
modeled. These errors are coded to present inconsistent and/or erroneous values from one of the computer 
channels and from both of the mvs clients. 


faultinjectionbus_byzantine_channel : MODULE = 
BEGIN 
INPUT 
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ypl, yp2,yp3: REAL 


OUTPUT 

cl_xl, cl_x2, cl_x3: REAL, 
c2_xl, c2_x2, c2_x3: REAL, 
cl_bl, cl_b2 , cl_b3: BOOLEAN, 
c2_bl, c2_b2, c2_b3: BOOLEAN 

DEFINITION 

% Channels 1 and 3 have good values 
cl_xl = ypl; cl_x3 = yp3; 
c2_xl = ypl; c2_x3 = yp3; 


% And Good Status indication 
cl_bl = TRUE; cl_b3 = TRUE; 
c2_bl = TRUE; c2_b3 = TRUE; 

% Channel 2 is inconsistently omissive 
Cl_b2 IN { TRUE, FALSE}; 

c2_b2 IN { TRUE, FALSE}; 

% And Byzantine 

cl_x2 IN { z : REAL | z >= 0 AND z <= p2p }; 

c2_x2 IN { z : REAL I z >= 0 AND z <= p2p }; 


END; 


For the BRIAN 3.0 architecture, a symmetric fault manifestation of up to two of the computer control 
channels is assumed 10 . This assumption is coded in th efaultinjectionbus_brain3 module. At each cycle of 
execution, the computer control outputs from Channels 2 and 3 to the first channel of the mvs are non- 
deterministically selected from TRUE and FALSE (good or faulty). The selected values of the Channel 2 
and 3 faults are presented to the input of the second mvs channel. 


c2_bl' = TRUE; 

cl_b2 1 IN { TRUE, FALSE}; 

cl_b3 1 IN {TRUE, FALSE}; 

c2_b2 1 = cl_b2 ' ; 
c2_b3 1 = cl_b3 ' ; 


4.1.4 MVS Evaluation 


The mvs module calculates the mid-value selection output of the ASM. As described in [1], the 
selected output is a function of the number of input streams. When all three streams are valid, the mid 
value of the three inputs is used. When only two inputs are valid, the function selects using the two valid 
inputs and the previous mid-value selection as a substitute input for the missing stream, as illustrated in 
the code below. 


% 3 inputs, 3 valid bits, OUTPUT 1 value 
mvs : MODULE = 

BEGIN 

INPUT 

xl, x2 , x3 : REAL, 
bl, b2 , b3 : BOOLEAN 

OUTPUT 
X : REAL 

INITIALIZATION 
x = 0 

TRANSITION 

% new mvs coasts when no good INPUT is available 
x ' = mi dval ( 


IF 

bl' 

THEN 

Xl' 

ELSIF 

b2 ' 

AND 

not(b3 ') 
not(b3 ') 
not(b2 ') 

THEN 

x2 ' 

ELSIF 

b3 ' 

AND 

not(b2 ' ) 
not(bl' ) 
not(bl' ) 

THEN 

x3 ' 

ELSE 

X 

ENDIF, 

IF 

b2 ' 

THEN 

x2 1 

ELSIF 

bl' 

AND 

THEN 

xl' 

ELSIF 

b3 ' 

AND 

THEN 

x3 ' 

ELSE 

X 

ENDIF, 
ENDIF) ; 

IF 

b3 ' 

THEN 

x3 ' 

ELSIF 

bl' 

AND 

THEN 

xl' 

ELSIF 

b2 ' 

AND 

THEN 

x2 ' 

ELSE 

X 


END; 


For the selected inputs, the module calls a function returning the mid-value selection. 


mi dval (yl: REAL, y2 : REAL, y3 : REAL): REAL = 

IF yl <= y2 then 

(IF y2 <= y3 THEN y2 elsIF 

ELSE 


ENDif ; 


(IF yl <= y3 THEN yl elsIF 


yl <= y3 THEN y3 ELSE yl ENDif) 
y2 <= y3 THEN y3 ELSE y2 ENDif) 


l0 As discussed earlier, it is emphasized that at the time this property has not been formally verified. 
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4.1.5 The Agreement Monitor 


The agreementmonitor implements a synchronous observer [9] that monitors key points of interest within 
the model. 

The obvious instrumentation is the value difference observer between the two mvs channels. The 
expected bound of the agreement is dependent on the case-study architectural policies. Hence, the 
agreement monitor module includes dedicated flags for each of the agreement thresholds. 

flag_async_bounded_agreement = (mvs_l_x - mvs_2_x <= error AND mvs_2_x - mvs_l_x <= error); 
flag_sync_bounded_agreement = (mvs_l_x - mvs_2_x <= error AND mvs_2_x - mvs_l_x <= error); 


4.1.6 Model Composition 

The modules described in the previous section were composed synchronously to support the evaluation of 
the cases-study architectures. 

• systemmonitor_I023_asynchronou,s : Represents the asynchronous system with inconsistent 
omission failure of up to two compute channels. 

• systemmonitor_byzantine_channel_asynchronous : The prevous architecture with a Byzantine 
failure of a single compute channel. 

• systemmonitor_I023_synchronous : Represents the time-triggered synchronous system with 
inconsistent omission failure of up to two compute channels. 

• systemmonitor_brain3_asynchronous\ Represents the asynchronous system with the 
symmetric fault model of the BRAIN 3 architecture. 


To further validate the abstraction, we composed additional system configurations that included a 
transient failure of the sampled waveform. This inclusion violated the expected rate of change assumption 
of the sawtooth waveform. Therefore, we wanted to check if the use of such a waveform would result in 
counter examples. These scenarios are captured in the systemmonitor_I023_asynchronous_wt. The 
modified sample waveform code is shown below. 


waveform_wt( t : REAL.tr : REAL): REAL = 
IF t <= period/2 THEN t 
elsif t >= tr then 0 
ELSE 

1.0 - CCt - 1.0) ) 

ENDif ; 


The position of the transient was non-deterministically assigned to lie within the plant period. 


tr IN {z : REAL | z >=0 and z <= period}; 

% ttl is initially constrained to be greater dt and less than period 
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4.1. 7 Proving Agreement Properties 


We used the formal models of the study architectures to explore the fault -tolerance and agreement 
properties of the case-study system. These initial experiments were performed using the sal-inf-bmc 
model checker. Initially, the depth of the exploration was set to be larger than the waveform period. The 
calculation of the required depth is dependent on the model structure. In this example, the periods of the 
mvs, fern and the sampled plant waveform are harmonically related, therefore exploring to the depth of 
the plant waveform is sufficient. 

Following these informal experiments, we used the k-induction capability of the sal-inf-bmc model 
checker to prove agreement. Using this initial abstraction both theorems were proven to be true. 

Using these additional lemmas, we could prove the bounded agreement using a depth k= 2. For the 
asynchronous architecture with two omission faults, the level of agreement corresponded to the temporal 
skew of the computer channel sampling. That is to say, the level of agreement was determined by the 
maximum rate of change of the plant waveform and the corresponding maximum divergence that may 
occur over the sampling period. 

expectecLerror : REAL =1.0 * dt * p2p 

For the asynchronous architecture, the Byzantine fault-scenario also was explored. Interestingly, 
under the Byzantine failure scenario, the level of bounded agreement was equivalent to the omission 
scenarios. 

For the synchronous architecture, the level of expected error was reduced to correspond to the 
synchronous system precision. Using k-induction, this level of expected behavior was found to be a 
correct bound. 

sync_e: REAL = 1.0 * sync_preci si on * p2p ; 


We repeated this process with the Byzantine channel present and once again did not observe a 
violation of the agreement bound property proving. 

Finally, for the BRAIN-based asynchronous architecture, we instrumented the agreement monitor to 
monitor exact agreements; (i.e., zero error). This property also was proven using an induction depth of k = 
1 . 

4.1.8 Additional Model Validation Experiments 

To further validate the asynchronous model, we lowered the threshold of bounded agreement to 0.99999 
of the expected value. This threshold returned a counter example as expected. In addition, the models 
with the transient disturbance injected into the plant waveform also returned counter examples as the rate 
of input change under the transient scenarios increased beyond the expected maximum assumed skew. 
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4.2 Discussion of Initial Model and Findings 


The initial formal investigation of the Phase 2 case-studies yielded a number of lessons and, in some 
cases, lessons re -learned. In the asynchronous architecture, the bound of agreement at the actuation 
interfaces is solely determined by the asynchronous plant sampling. With our simplified saw-tooth plant 
waveform this corresponded to the delta change in the plant waveform that occurs during the period 
duration of the sampling task. Using the sal-inf-bmc model checker we were able to prove this bound of 
agreement, for fault scenarios comprising one or two inconsistent omissive faults. Interestingly the 
presence of a single Byzantine fault did not degrade this level of agreement. Therefore it may be argued 
that this class of architecture is not vulnerable to Byzantine failure". Although, this finding was initially 
surprising it is in line with our definitions [10] of Byzantine faults and Byzantine failure: 

• Byzantine fault: a fault presenting different symptoms to different observers. 

• Byzantine failure: the loss of a system service due to a Byzantine fault in systems that 
require consensus. 

Using the above, if the level of disagreement due to the asynchronous sampling is sufficient for 
performance, then additional strategies for exact agreement are not required. This finding helps qualify 
our findings from the asynchronous case-study of the first year [2], From this study, we concluded that 
the overhead required for exact agreement may be too expensive for practical use. This new finding 
further helps to illuminate the differences between the asynchronous and synchronous design mindsets. If 
the system can perform with such levels of inexact agreement, it easier to understand why system 
architects with an asynchronous system design preference resist any suggestion to increase the level of 
channel coupling to achieve a tighter bound of agreement if it is not be needed. Such strategies allow 
them to avoid common-mode influence from such agreement services that can increase system brittleness. 
That being said, as illustrated [11], developing voting and fault -isolation strategies with inexact 
agreement can be relatively complex, requiring extensive knowledge of the system dynamics. The 
increase in channel coupling from cross-lane equalization also needs also to be considered and analyzed 
under normal and failure modes. 12 Consequently, validating the effectiveness of such strategies is also 
non-trivial. 13 In addition, arguing platform fault-tolerant properties independent of the hosted application 
is very difficult within an asynchronous architecture, since the plant dynamics and asynchronous tasking 
rates are closely coupled to the fault-detection thresholds and performance. 

For the time -triggered synchronous architecture, the platform fault-tolerant properties are simpler to 
establish, since they are dominated by the system precision and less influenced by tasking rate and plant 
dynamics. In our study, the mid-value-selection also constrained the influence of a single Byzantine fault 
to be within the agreement bound supported by the synchronous precision. Obviously, the 
synchronization services underpinning such a system must also be validated for Byzantine fault -tolerance. 

Finally, it is interesting to note that the consistent broadcast guarantee 14 of the BRAIN may achieve 
exact agreement in either synchronous or asynchronous operational modes. This is an interesting option 
for such architectures. However, the impact of transient errors also needs to be assessed. The extended 
hierarchal agreement services detailed in [12] may be one option to increase the transient robustness. 


11 Alternatively it may be argued that the asynchronous sampling may itself be considered equivalent to a Byzantine 
fault. 

12 Given the complexity and application-specific nature of equalization schemes, such analysis may be more 
complex than that of a fault-tolerant synchronization service. 

13 Fortunately, in the case-study architecture the majority of the fault-detection is implemented at the source via the 
comparison of the Command and Monitor lanes. Hence in such systems, the fault detection threshold calculation 
may be simpler to establish. 

14 Yet to be formally verified 
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4.3 Initial Model Checking Performance and findings 

The model checking performance also was acceptable for all models, with proofs completing within a few 
seconds. We believe that the non-deterministic assignment of the initial plant sampling phase-offsets 
facilitates the exploration of phase-related emergent behavior. The performance of the bounded model 
checker was satisfactory, yielding results within a few seconds. In addition, in this initial model, deriving 
a proof from the model was straight forward, since it did not require the generation of any additional 
auxiliary lemmas. 

However, in this initial model the synchronous composition of the mvs modules is a deficiency and 
the synchronous abstraction may miss effects resulting from the asynchronous boundary between the mvs 
modules and their respective tasking rates. In addition, although this initial model is sufficient for the 
open loop exploration, we are uncertain how to evolve this model to integrate the closed-loop control and 
plant models model. To ameliorate these shortcomings, we explore an alternative abstraction in the next 
section. 
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4.4 An Alternative Timeout Automata Based Abstraction 


To address the issues discussed in the previous section, we developed an alternative abstraction using the 
timeout automata[13]. We hope that this abstraction will enable the impact of the asynchronous 
interaction of the mvs systems to be analyzed. We further hope that this revised abstraction will facilitate 
for the integration of closed- loop control and plant behavior into the model. 

The initial timeout-automata based abstraction contains very similar components to those described 
in the previous section. Hence the details of the subcomponents are not elaborated in detail below. Instead 
we concentrate on the differences related to the capture of the asynchronous tasking using the timeout 
automata based framework. 

4.4.1 Clock Module 

A central component of the revised model is the clock module. This central module is responsible for 
incrementing the global time. The progression of time is governed by this module. Each system 
component provides an input to this module (via the jtimeout signals). This input corresponds to the time 
of the respective components next timed action (i.e., the value of its local timeout). The clock module 
evaluates the global array of timeout events and advances a global time variable to the lowest value in the 
global timeout array. 


% Clock module: advance time to min(fcm_timeoutl, fcm_timeout2 , fcm_timeout3 , mvs_timeoutl, mvs_timeout2) 

% 

clock: MODULE = 

BEGIN 

INPUT 

fcm_timeoutl, fcm_timeout2 , fcm_timeout3 : TIME, 
mvs_timeoutl, mvs_timeout2 : TIME 
OUTPUT 

time: TIME 
INITIALIZATION 
time = 0; 

TRANSITION 

[ time < fcm_timeoutl AND time < fcm_timeout2 AND time < fcm_timeout3 AND time < mvs_timeoutl AND time < mvs_timeout2 

--> 

time' IN { t: TIME | t <= fcm_timeoutl AND t <= fcm_timeout2 AND t <= fcm_timeout3 
AND t <= mvs_timeoutl AND t <= mvs_timeout2 

AND (t = fcm_timeoutl OR t = fcm_timeout2 OR t = fcm_timeout3 OR t = mvs_timeoutl OR t = 

mvs_timeout2) }; 

] 

END; 


4.4.2 Source Module 


This second abstraction also introduces a source module to represent the stimulus of the system. This 
enables improved modeling of the fcm sampling 15 . This module derives a period count from the time and 
this is value is subtracted from the time value before calling the waveform function (because the 
waveform function is only defined for a single plant period). 

% 

% Source module: at time t, the output is x = waveform(t - k * period) 

% where period * k <= t < period * (k + 1) 

% 

source: MODULE = 

BEGIN 

INPUT time: TIME 
LOCAL period_counter : INTEGER 
OUTPUT x: REAL 
DEFINITION 

period_counter IN { n: INTEGER | n * plant_period <= time AND time < (n+1) * plant_period }; 
x = waveformCtime - period_counter * plant_period) ; 


END; 


waveform(t: REAL): REAL = 


15 Note that this is a simplification from real system, where the input sampling is performed in the ASM modules. 
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IF t < 0 OR t > plant_period THEN 0 
ELSIF t <= plant_period/2 THEN A * t 
ELSE amplitude - A * (t - plant_period/2) 
ENDIF; 


4.4.3 FCM Module 

The fcm is very similar to the fan of the initial model. In comprises a very simple state transition. When 
the value of time equals the timeout value the fcm updates the sample value y, to the current value of the 
source x. In this initial model y is initialized to 0 and timeout is initialized to the 1 st value. Due to the 
synchronous system composition (see section 4.1.6) the fan module also defines an empty else transition. 


faulty_fcm: MODULE = 

BEGIN 

INPUT 

time: TIME 
OUTPUT 

timeout: TIME, 
y: REAL 
INITIALIZATION 

timeout IN { t: TIME | 0 <= t AND t < fcm_period }; 
TRANSITION 

[ time = timeout — > 

timeout' IN { t: TIME | time + epsilon <= t } ; 
y' IN { x: REAL | true }; 

□ 

ELSE --> 

] 

END; 


Note: The initial value pre_y has been included to support the proof by induction. This is discussed 
later with the proof 


4.4.4 MVS Module 


The mvs module is very similar to the previous model. When the time is equal to the timeout function the 
modules updates the mvs output using the midval function. Similar to the initial abstraction, the mvs 
calculation is a function of the number of valid inputs, with input validity been denoted by Boolean flags 
bn for input each channel. The model also defines an empty ELSE transition. This is to support the 
synchronous composition of this module with other system modules (see section 4.1.6). 


mvs_period: TIME = 0.05; 

mi dval (yl: REAL, y2: REAL, y3: REAL): REAL = 

IF yl <= y2 THEN 

(IF y2 <= y3 THEN y2 ELSIF yl <= y3 THEN y3 ELSE yl ENDIF) 
ELSE 

(IF yl <= y3 THEN yl ELSIF y2 <= y3 THEN y3 ELSE y2 ENDIF) 
ENDIF; 


mvs: MODULE = 

BEGIN 

INPUT 

yl, y2, y3: REAL, 
bl, b2 , b3 : BOOLEAN, 
time: TIME 
OUTPUT 

timeout: TIME, 
mvs : REAL 


INITIALIZATION 

timeout IN { t: time | 0 <= t AND t < mvs_period }; 
mvs = 0; 


TRANSITION 

[ time = timeout — > 
timeout' = time + 
mvs' = mi dval ( IF 
IF 
IF 


□ 


ELSE --> 


] 


mvs_period; 
bl' THEN yl' 
b2 ' THEN y2 ' 
b3 ' THEN y3 ' 


ELSIF b2 ' AND 
ELSIF bl' AND 
ELSIF bl' AND 


NOT(b3') THEN 
NOT(b3 ' ) THEN 
NOT(b2') THEN 


y2 ' ELSIF b3 ' 
yl' ELSIF b3 ' 
yl' ELSIF b2 ' 


AND NOT(b2') 
AND NOT (bl ' ) 
AND NOT(bl') 


THEN y3 ' ELSE 
THEN y3 ' ELSE 
THEN y2 ' ELSE 


mvs ENDIF, 
mvs ENDIF, 
mvs ENDIF); 


END; 


23 



4.4.5 Fault Injection 


This module also incorporates fault injection. In the initial experiments this However, due to the 
difficulties encountered with the agreement proof (see later discussion), the faults are set to be inactive. 


% Fault model: sampler 2 is faulty so cl_b2 and c2_b2 can true or false 

% 

fault_injection : MODULE = 

BEGIN 

OUTPUT 

cl_bl, cl_b2, cl_b3: BOOLEAN, 
c2_bl, c2_b2, c2_b3: BOOLEAN 

DEFINITION 

cl_bl = true; 

cl_b2 IN { true, false } ; 
cl_b3 = true; 

c2_bl = true; 

c2_b2 IN { true, false }; 

c2_b3 = true; 

END; 

c2_b3 = true; 

END; 


4.4.6 System Composition 

The system composition is depicted below. The mvs and fcm modules are first synchronously composed 
into a system. This system is then asynchronously composed with the synchronous composition of the 
clock and source modules. It is crucial to compose the clock module asynchronously with the system. 
Otherwise, there would be a clear deadlock. 


% Full system: 

% - synchronous composition of these components 

% 

system: MODULE = 

(rename timeout TO fcm_timeoutl, y TO yl, pre_y TO pre_yl IN fcm) 

|| (RENAME timeout TO fcm_timeout2 , y TO y2 IN faulty_fcm) 

| | (RENAME timeout TO fcm_timeout3 , y TO y3, pre_y TO pre_y3 IN fcm) 

| | faul t_i njection 

II (RENAME bl TO cl_bl, b2 TO cl_b2 , b3 TO cl_b3, timeout TO mvs_timeoutl, mvs TO mvsl IN mvs) 

jj (RENAME bl TO c2_bl, b2 TO c2_b2 , b3 TO c2_b3, timeout TO mvs_timeout2 , mvs TO mvs2 IN mvs); 


With this composition two types of transitions are taken alternatively: 

• 'transition in the physical model (clock + source)' 

• 'transitions in the system (samplers + voters)'. 

In the first type of transition, time advances and x is updated. In the second type of transition, time 
and x are fixed but at least one of the samplers Anvs/fcm modules makes a discrete step. On every 
transition of clock, 'time' is increased where as on every transition of system 'time' stays unchanged. 
Therefore the clock and system can't be composed synchronously. However, the clock and source can 
be composed synchronously given that the output variable of the source module changes when 'time' 
changes. 

4.4. 7 Investigating and Proving the System Agreement Properties 

This second model presented many more difficulties than the course abstraction of the first model. Firstly, 
given that the initial phase of the plant waveform is set to 0 in this second abstraction, the bounded model 
checking experiments require a significant depth of exploration. The required depth is calculated by the 
plant waveform period (2s) divided by the highest sampling rate (0.05s), i.e., a value of 40. However, 
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since the mixed asynchronous/synchronous composition of the system components requires two steps to 
increase the value of the clock value, this value needs to be doubled to 80. To cross-check this value, a 
simple time-based lemma was added to the model to validate time against the depth of model exploration. 

t: LEMMA full |-G(time < plant_period) ; 


However, given this required depth of examination, the bounded model checking experiments were 
relatively slow, requiring approximately one hour of compute time on a high performance processor. 
Secondly, a significant effort was required to produce the auxiliary lemma to support a k-induction based 
proof of the agreement (even under no-fault scenarios). 

The first set of invariants constrains the value of time and the timeout parameters. The value time is 
constrained to be less than the current timeout and the timeout is, in turn, bounded to be less than the 
current time plus the respective period. These invariants are provable by an induction depth of 1 . 


% 


fcm. 

.timeout. 

.boundsl: 

LEMMA 

full 

|- G(time 

<= fcm_timeoutl 

AND 

fcm. 

.timeoutl 

<= time 

+ 

fcm_period) ; 

fcm. 

.timeout. 

_bounds2 : 

LEMMA 

full 

|- G(time 

<= fcm_timeout2 

AND 

fcm. 

_timeout2 

<= time 

+ 

fcm_period) ; 

fcm. 

.timeout. 

_bounds3 : 

LEMMA 

full 

|- G(time 

<= fcm_timeout3 

AND 

fcm. 

_timeout3 

<= time 

+ 

fcm_period) ; 

mvs. 

.timeout. 

.boundsl: 

LEMMA 

full 

|- G(time 

<= mvs_timeoutl 

AND 

mvs. 

.timeoutl 

<= time 

+ 

mvs_period) ; 

mvs. 

.timeout. 

_bounds2 : 

LEMMA 

full 

|- G(time 

<= mvs_timeout2 

AND 

mvs. 

_timeout2 

<= time 

+ 

mvs_period) ; 


The second set of invariants constrains the fcm sample values (y). These are constrained to be within 
a fixed delta of stimulus source waveform output (x), where the size of the delta is derived from the fcm 
sampling period and the stimulus amplitude. 


% 

% Provable by induction at depth 1 

% 

sampling_errorl: LEMMA 

full |- G(x - yl <= A * (time - (fcm_timeoutl - fcm_period)) AND yl - x <= A * (time - (fcm_timeoutl - fcm_period))) 
sampling_error2: LEMMA 

full |- G(x - y2 <= A * (time - (f cm_ti meout2 - fcm_period)) AND y2 - x <= A * (time - (fcm_timeout2 - fcm_period))) 


sampling_error3: LEMMA 

full f- G(x - y3 <= A * (time - (fcm_timeout3 - fcm_period)) AND y3 - x <= A * (time - (fcm_timeout3 - fcm_period))) 


The next set of lemmas constrain the behavior of two successive samples; that is, a previous sample is 
constrained to be within a fixed delta of the current sample, where the delta once again is a function of the 
waveform amplitude and the sampling 


% 

% Bound on the difference between two successive samples 
% - provable by induction at depth 1, using sampl ing_error<i> as a lemma 

% 

pre_sampl i ng_del tal: lemma 

full |- G(yl - pre_yl <= A * fcm_period AND pre_yl - yl <= A * fcm_period); 
pre_sampling_delta2: LEMMA 

full |- G(y2 - pre_y2 <= A * fcm_period AND pre_y2 - y2 <= A * fcm_period); 
pre_sampl i ng_del ta3 : LEMMA 

full |- G(y3 - pre_y3 <= A * fcm_period AND pre_y3 - y3 <= A * fcm_period); 


The set of auxiliary lemmas below constrains the sampling differences across the fcm channels. These 
are pair-wise that constrain the channel samples (y) to be within the fixed delta derived from the 
waveform amplitude and sampling rate. The invariants can be proven using the respective period 
channel’s sampling error. 


% 

% Bound on the difference betweeen two sampling channels 

% - to prove sampl i ng_del ta(i ,j) use induction at depth 1, with sampl ing_error<i> and sampl i ng_error<j> as lemmas 

% 

sampl i ng_del tal2 : LEMMA 

full |- G(fcm_timeoutl >= fcm_timeout2 => yl - y2 <= A * (fcm_timeoutl - fcm_timeout2) 

AND y2 - yl <= A * (fcm_timeoutl - fcm_timeout2)) ; 


sampl ing_delta21: LEMMA 

full |- G(fcm_timeout2 >= fcm_timeoutl => yl - y2 <= A * (fcm_timeout2 - fcm_timeoutl) 
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AND 


sampling_deltal3 : LEMMA 

full T- GCfcm_timeoutl >= fcm_timeout3 => 

AND 


sampling_delta31: LEMMA 

full \- G(fcm_timeout3 >= fcm_timeoutl => 

AND 


sampling_del ta23 : LEMMA 

full f- G(fcm_timeout2 >= fcm_timeout3 => 

AND 


sampling_delta32 : LEMMA 

full |- G(fcm_timeout3 >= fcm_timeout2 => 

AND 


y2 

- yi 

<= A * 

(fcm_timeout2 

- fcm_timeoutl)) ; 

yi 

- y3 

<= A * 

(fcm_timeoutl 

- fcm_timeout3) 

y3 

- yi 

<= A * 

(fcm_timeoutl 

- fcm_timeout3)) ; 

yi 

- y3 

<= A * 

(fcm_timeout3 

- fcm_timeoutl) 

y3 

- yi 

<= A * 

(fcm_timeout3 

- fcm_timeoutl)) ; 

y2 

- y3 

<= A * 

(f cm_ti meout2 

- fcm_timeout3) 

y3 

- y2 

<= A * 

(fcm_timeout2 

- fcm_timeout3)) ; 

y2 

- y3 

<= A * 

(fcm_timeout3 

- fcm_timeout2) 

y3 

- y2 

<= A * 

(fcm_timeout3 

- fcm_timeout2)) ; 


The final two lemmas constrain the behavior for the mvs selection under no fault scenarios. Each is 
defined to be the mid-value functions of the current active sample or the previous sampled value. Each is 
provable at induction depth 1 using th efcm_timeout auxiliary lemmas introduced above. 


mvs_invarl: LEMMA full |- G(mvsl 

= midval(lF fcm_timeoutl - fcm_period <= mvs_timeoutl - mvs_period THEN yl ELSE pre_yl endif, 

IF fcm_timeout2 - fcm_period <= mvs_timeoutl - mvs_period THEN y2 ELSE pre_y2 endif, 

IF fcm_timeout3 - fcm_period <= mvs_timeoutl - mvs_period THEN y3 ELSE pre_y3 ENDIF)); 


mvs_invar2: LEMMA full |- G(mvs2 

= midval(lF fcm_timeoutl - fcm_period <= mvs_timeout2 - mvs_period THEN yl ELSE pre_yl endif, 

IF fcm_timeout2 - fcm_period <= mvs_timeout2 - mvs_period THEN y2 ELSE pre_y2 ENDIF, 

IF fcm_timeout3 - fcm_period <= mvs_timeout2 - mvs_period THEN y3 ELSE pre_y3 ENDIF)); 


Using the mvs_invarl, mvs_invar2 ,sampling_deltal2 , sampling _deltal 3 , sampling _deltci23, 
sampling _delta21 , sampling _delta3 1 , sampling _delta32 , pre_sampling_deltal, pre_sampling_delta2, 
pre_sampling_delta3 auxiliary lemmas, the agreement property can be proved at an induction depth of 

1 . 


agreement: THEOREM full |- G(mvsl - mvs2 <= error AND mvs2 - mvsl <= error); 


4.5 Investigating Faults with the Timeout Automata Based Abstraction 

The previous model facilitates the proof of agreement in the fault-free case. However, the proof does not 
hold under fault conditions. In addition, the depth of required exploration in the previous model precludes 
practical interactive model checking, since each run requires over 2 hours of execution time. Therefore, 
the model was modified to incorporate a non-deterministically selected initial plant waveform phase 
offset. To do this, an additional global start _phase_type and an associated unbounded start-phase 
constant was added to the context 

start_phase_type : TYPE = { t: TIME | 0 <= t AND t < plant_period - fcm_period }; 

start_phase : start_phase_type ; 


The initialization value of time of the dock, mvs, and fcm modules were then defined as a function of 
the start jphcise constant. Similarly, the initial sample (y), pre-sample ( pre_y ), and mvs values were 
initialized to be the value of the source waveform at the time of the start jphase value. 

mvs module ... 

timeout IN { t: TIME | start_phase <= t AND t < start_phase + mvs_period }; 
mvs = waveform(start_phase) ; 


timeout IN { t: TIME | start_phase <= t AND t < start_phase + fcm_period }; 
y = waveform(start_phase) ; 
pre_y = waveform(start_phase) ; 
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With this modification, the depth of model exploration and search time can be greatly reduced as the 
depth of exploration can be reduced to 4. This value accommodates for the two steps incurred from the 
asynchronous composition of ( clock II source ) and two discrete steps for 'system' 16 . 

4.5.1 Proving Agreement with Faults 

To support the proof of agreement using the timeout automata model additional auxiliary lemmas were 
required. The following lemmas constrain the values of the samples on channels 1 and 3 to be a function 
of the source waveform at the time of reference 

y_invarl: lemma 

full |- G(fcm_timeoutl >= fcm_period => yl = waveform(fcm_timeoutl - fcm_period - nl * plant_period)) ; 
y_i nva rB: LEMMA 

full |- G(fcm_timeout3 >= fcm_period => y3 = waveform(fcm_timeout3 - fcm_period - n3 * plant_period)) ; 

Similarly the previous samples of channels 1 and 3 are also constrained 

% depth 1, lemma: y_invarl 
pre_y_invarl: LEMMA 

full |- G(fcm_timeoutl >= 2 * fcm_period => pre_yl = waveform (fcm_timeoutl - 2 * fcm_period - pre_nl * plant_period)) ; 

% depth 1, lemma: y_invar3 
pre_y_invar3 : LEMMA 

full |- G(fcm_timeout3 >= 2 * fcm_period => pre_y3 = waveform(fcm_timeout3 - 2 * fcm_period - pre_n3 * plant_period)) ; 


The initial values of the sample and pre-sample values are also constrained 


y_initl: LEMMA 

full |- G(fcm_timeoutl < fcm_period => yl = 0) ; 
y_init3: LEMMA 

full |- G(fcm_timeout3 < fcm_period => y3 = 0) ; 


pre_y_initl: LEMMA 

full |- G(fcm_timeoutl < 2 * fcm_period => pre_yl = 0); 
pre_y_init3: LEMMA 

full |- G(fcm_timeout3 < 2 * fcm_period => pre_y3 = 0) ; 


Each of the above invariants is provable using an induction depth of 1 . 

Additionally the lemmas constraining the bounds on the mvs values were revised as shown below 

mvs_boundsl: LEMMA full |- G((zll <= mvsl AND mvsl <= zl3) OR (zl3 <= mvsl AND mvsl <= zll)); 
mvs_bounds2: LEMMA full |- G((z21 <= mvs2 AND mvs2 <= z23) OR (z23 <= mvs2 AND mvs2 <= z21)); 


These can be proved using an induction depth of 2 in conjunction with the fcm_timeout auxiliary 
lemmas presented in the previous section. 

Agreement can then be proved at an induction depth of 1 using all of the above lemmas in 
conjunction with the fcm_timeout lemmas of the previous section. 


16 However it is emphasized that this assumes depth assumes the harmonic relationship among the plant, 
mvs and fcm period. If this property is not valid, this depth may need to be revised and increased. 
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4.6 Run Scripts and Model Source Files 

Source files and run scripts for the models described in the previous sections are posted at the NASA 
DASHlink site AFCS - Distributed Systems (https://c3.nasa.gov/dashlink/projects/79/). 


Section 

Model 

Run File 

4.1.1 -4.1.7 

mvs.sal 

run_mvs.sh 

4.4.1-4.4.74 

mvs_with_timoutl.sal 

run_mvs_with_timeoutl.sh 

4.5.1 

mvs_with_timeouts3.sal 

run_mvs_with_timeouts3.sal 


5 Mathematical Analysis of Mid- Value-Selection 

This section presents a mathematical analysis of the actuation force-fight, with the intention of formally 
bounding the level of actuation agreement. Figure 6 illustrates three 1Hz sinusoidal CM outputs where 
the third channel drops off when time equals 3.0 seconds. In this example c 1 = 1 ,c 2 = 0.9, c 3 = 
0.75 ,cp 1 = 0, (p 2 = 0.1 ,cp 3 = 0.2. 



Let, q sin(x + cp t ) be the output from the tth channel, where 0 < c t < 1, 0 < (p t < 1. An 
intersection between two channels will occur when c t sin(x + <p L ) = Cj sin(x + <p ; ). Using standard 
trigonometric identities, both sides of the equation can be expanded to 

Ci sin(^j) cos(x) + q cos(^i)sin(x) = Cj sin cos(x) + c ; - cos((p ; )sin(x) (1) 

Two special cases emerge. One, where c L = Cj, (p L = cpj , the commands are identical and intersect 
everywhere. The other, where c t ^ c 2 , <P\ = (p 2 , the commands intersect at nn, n = 0,1, 2, ... with a zero 
value. For all other cases, we can simplify (1), 

sin (x)(q cos (cpi) — Cj cos (<Pj)) = cos (x)(c ; -sin (<p ; ) — c t sin((P()), 
and find the points of intersection at 
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( 2 ) 



CjS\n(<pj) — Cj sin(<pj) \ 
Cj. COS{(Pi ) - Cj cos{(pj)j 


+ nn, n = 0,1,2, ... 


Note, these values are in radians and need to be converted back to time by dividing the results by 2n. 
From (2) we have three sets of intersections, in seconds: 


{x 12 } = {0.1744 + n/2,n = 0,1,2,...} 
{x 13 } = {0.1191 + n/2,71 = 0, 1, 2, ... } 
{x 23 } = {0.0566 + n/2, n = 0, 1, 2, ... } 


The model for the mid-value select is to use the median of the three channels if all are available. If 
only two channels are available, replace the missing channel with the previous mid-value and recalculate 
the median. If only one channel is available, the use that channel without any modification. 

The mid- value select algorithm, described above, is implemented in MATLAB notation as: 


function mv=midval Cyl,y2 ,y3) 
if (yl <= y2) 
if (y2 <= y3) 
mv=y2 ; 

el seif (yl <= y3) 
mv=y3 ; 
el se 
mv=yl ; 
end 
el se 

if (yl <= y3) 
mv=yl ; 

el seif (y2 <= y3) 
mv=y3 ; 
el se 
mv=y2 ; 
end 
end 
end 


It is possible to bound this disagreement in closed form. The main disagreement is due to the 
difference between sinusoids. Recall, Cj sin(x + cp{) is the output from the ith channel, where 0 < q < 
1, 0 < cpi < 1. The difference between these channels is 


Cj sin(x + cpi ) — Cj sin(x + (pj) 

Using the same identity as above (3) expands to 

Cj cos(cpj) cos(x) + Cj sin(cpj) sin(x) — c ; - cos (^ ; ) cos(x) + Cj sin (cpj) sin(x) 
Which simplifies to 

d 1 cos(x) + d 2 sin(x), 

d x = Cj cos(cpj) — Cj cos (<p ; ) ,d 2 = Ci sin(^)j) — c ; - sin(^3 ; ) 


We can write (5) in the form 
where, 


Asin(x + a), 
Acos(a) = d 1 ,Asin(a) = d 2 . 
But, A 2 =A 2 cos 2 (a)+A 2 sin 2 (a) = d^ 2 + d 2 z . So, 


A — d-j 2 + d 2 2 , 


( 3 ) 

( 4 ) 

( 5 ) 

(6) 


2 ^ 

A = J (q cosOj) - cj cos ((pj)) + ( Cj sinOj) - c } sin(^)) , 


( 7 ) 
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Ci 2 + Cj 2 — 2CiC ; (cos (c^) cos (<p 2 ) + sinO^) sin (<p 2 )) 


A = 



The phase shift, a, can be calculated as 



For the example in the previous figures, c 1 = 1 ,c 2 = 0.9 ,c 3 = 0.75, cp 1 = 0 ,(p 2 = 0.1 ,cp 3 = 0.2. 
Note, the phase shift is expressed in seconds, not radians. 


A 12 = 0.59, a 12 = 0.42 

A 13 = 1.05, a 13 = 0.37 (9) 

A 23 = 0.53, a 23 = 0.31 

For the three FCM command outputs from Figure 7, the resulting mid-value select algorithm 
generated the dark blue. The disagreement between the mid-value select and the individual channels can 
be calculated by simple differencing. This is illustrated in Figure 8. 


Mid-Value Select - Sinusoid, Phase Shift, Amplitude Difference 



Figure 7 Mid-value selection from 3CMs, Channel 3 fails at time=3.0 sec. 


Mid-Value Select - Phase Shift, Amplitude Difference - Disagreement 



Figure 8 Disagreement between CM channels and Mid- Value Select. 
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5.1 Inconsistent Omission Error Force Fight 

Figure 7 and the associated analysis allow us to calculate the force fight resulting from a dual Actuator 
Sense Module (ASM) with an inconsistent omission error with an asynchronous, sinusoidal plant. 
Illustrated in Figure 9, three signals are provided to one mid-value select algorithm, while only two 
signals are presented to the other ASM. The difference in the output of the MVS algorithms represents the 
force fight. 

For the example presented in Figure 7, the period before the 3 second mark represents the output of 
ASM1 MVS. The period after 3 seconds represents the ASM 2 MVS. Note, in this that each MVS tracks 
a different signal for most of the time bounding the force fight by the sinusoid defined in (6). The force 
fight is presented in Figure 10. 


qsin(x + ^> 1 ) 
c : siti(.r + £> : ) 
c 3 sin(x + ip 3 ) 


ASM 1 
MVS 


ASM 2 
MVS 


Figure 9 Dual ACE Force Fight - Inconsistent Omission Error. 


Mid-Value Select- Force Fight 



Figure 10 Force Fight - Inconsistent Omission Error at t=3.0 sec. 

In the earlier SAL analysis a triangle wave command was used in lieu of a sin wave. Figures 1 1 
presents the equal amplitude commands used in the SAL study. Figure 12 presents the force fight when 
either Channel 1 , 2, or 3 fail. 
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Figure 11 Triangle Wave used in SAL Analysis. 


Mid- Value Select- Force Fight 



Time 


Md-Va*je Select -Force Fijfit 



Time 


Mid-Value Select - Force Fight 



Channel 1 Fail Channel 2 Fail Channel 3 Fail 

Figure 12 Force Fight from Triangle CM Commands - Single Channel Inconsistent Omissions. 
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6 Conclusions and Future Work 


The Phase -2 case-study architectures [1] present a number of interesting challenges with regard to formal 
verification. Originally, it was our intention to form an integrated model that would support the 
verification of the agreement properties within the context of the plant model and control system logic. 
Our hope was that the SAL and Hybrid SAL tool chains would support the model capture and formal 
verification. However, the lessons learned from the open-loop simplified model indicate that such a 
strategy may not be the best option and that the SAL tool chain may not be the best vehicle to develop the 
proof. The effort and time required to manually develop the auxiliary lemmas to support the required 
proof was significant. Although automated invariant generation methods do exist, for example [15], these 
methods have not yet been implemented within SAL. We further conjecture that the invariant used by the 
mvs model may be beyond what these automated method can find because auxiliary variables are 
required. Given this experience at the time of writing we believe that it may be preferable to use SAL as a 
debug tool and use PVS or related technologies to formalize the proof arguments. In addition, the visual 
results using MATLAB allowed us to develop an intuition that was then used to form a closed form 
solution to the force fight for both sinusoidal and triangle CM command waveforms. When learning new 
technologies, such as SAL, it is easy to get lost in the intricacies of the new tool and language, which may 
allow more simple solutions to be overlooked. 

Given the above, it is our intuition that an integrated model may not be the best approach and 
attacking the problem in stages may be preferable. For example, use SAL or Hybrid SAL to characterize 
and potentially prove the control feedback characteristics (e.g. maximum rate of change, etc.) then use 
these abstracted characteristics to formally investigate agreement in a separate module. At the current 
time, the ability to reason and verify complex agreement strategies that incorporate cross-lane 
equalization and mode consolidation within the control logic is also uncertain. This will be an area of 
focus in Year 3 of the work. 

The work performed to date, focusing on the output agreement, has proved to be very educational 
with respect to our understanding of the distributed agreement properties under inconsistent omissive and 
Byzantine fault-scenarios. We believe that the alternative behavior and properties of the BRAIN 3.0 
protocol is also an interesting contribution to such architectures, supporting a discussion of agreement 
without the asynchronous vs. synchronous system philosophical discussions. 

In upcoming work, we intend to focus on the asynchronous architectures. We intend to refine the 
techniques to develop an integrated argument. We further intend to augment the model with 
representative lane -equalization, input selection, and asymmetric fault management logic. We also intend 
to explore the system validation actives performed on similar real-world system to assess the applicability 
and feasibility of applying the analysis developed here-in. We will also investigate the feasibility and 
potential benefit that can be derived from tests generated from the formal model abstractions. 

We further intend to extending this work to investigate more elaborate equalization strategies such as 
[14], and to explore the issues of multi -rate system [15]. 

Finally, we intend to characterize these implications of the asynchronous control architecture and to 
contrast the bandwidth and CPU efficiency of our three case-study architectures; for example, to formally 
evaluate the required computational for equivalent levels of agreement. 
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